Combining Quantitative and Logical Data Cleaning
نویسندگان
چکیده
Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths. We instantiate our framework using (i) metric functional dependencies, a type of dependency that generalizes functional dependencies (FDs) to identify inconsistencies in domains where only large differences in metric data are considered to be a data quality problem, and (ii) repairs that modify the inconsistent data so as to minimize statistical distortion, measured using the Earth Mover’s Distance. We show that the problem of computing a statistical distortion minimal repair is NP-hard. Given this complexity, we present an efficient algorithm for finding a minimal repair that has a small statistical distortion using EMD computation over semantically related attributes. To identify semantically related attributes, we present a sound and complete axiomatization and an efficient algorithm for testing implication of metric FDs. While the complexity of inference for some other FD extensions is co-NP complete, we show that the inference problem for metric FDs remains linear, as in traditional FDs. We prove that every instance that can be generated by our repair algorithm is set-minimal (with no unnecessary changes). Our experimental evaluation demonstrates that our techniques obtain a considerably lower statistical distortion than existing repair techniques, while achieving similar levels of efficiency. ∗Supported in part by NSERC BIN This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 9, No. 4 Copyright 2015 VLDB Endowment 2150-8097/15/12.
منابع مشابه
Wisteria: Nurturing Scalable Data Cleaning Infrastructure
Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. Wh...
متن کاملArnold: Declarative Crowd-Machine Data Integration
The availability of rich data from sources such as the World Wide Web, social media, and sensor streams is giving rise to a range of applications that rely on a clean, consistent, and integrated database built over these sources. Human input, or crowd-sourcing, is an effective tool to help produce such high-quality data. It is infeasible, however, to involve humans at every step of the data cle...
متن کاملHow to Be Both Rich and Happy: Combining Quantitative and Qualitative Strategic Reasoning
We propose a logical framework combining a game-theoretic study of abilities of agents to achieve quantitative objectives in multi-player games by optimizing payoffs or preferences on outcomes with a logical analysis of the abilities of players for achieving qualitative objectives of players, i.e., reaching or maintaining game states with desired properties. We enrich concurrent game models wit...
متن کاملHow to Be Both Rich and Happy: Combining Quantitative and Qualitative Strategic Reasoning about Multi-Player Games (Extended Abstract)
We propose a logical framework combining a game-theoretic study of abilities of agents to achieve quantitative objectives in multi-player games by optimizing payoffs or preferences on outcomes with a logical analysis of the abilities of players for achieving qualitative objectives of players, i.e., reaching or maintaining game states with desired properties. We enrich concurrent game models wit...
متن کاملLarge-scale ultrasonic cleaning system: Design of a multi-transducer device for boat cleaning (20kHz).
The present study is part of a global project which consists in the development of an automatic cleaning station for immersed boats (cockle, ninepin, etc.) in a self-service mode, associating an innovative ultrasonic device for cleaning with a specific water treatment. The originality of the process is that cleaning is performed by three transducers operating simultaneously at low frequency and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 9 شماره
صفحات -
تاریخ انتشار 2015